Assignment 3 - Building a Custom Visualization

In this assignment you must choose one of the options presented below and submit a visual as well as your source code for peer grading. The details of how you solve the assignment are up to you, although your assignment must use matplotlib so that your peers can evaluate your work. The options differ in challenge level, but there are no grades associated with the challenge level you chose. However, your peers will be asked to ensure you at least met a minimum quality for a given technique in order to pass. Implement the technique fully (or exceed it!) and you should be able to earn full grades for the assignment.

Ferreira, N., Fisher, D., & Konig, A. C. (2014, April). Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571-580). ACM. (video)

In this paper the authors describe the challenges users face when trying to make judgements about probabilistic data generated through samples. As an example, they look at a bar chart of four years of data (replicated below in Figure 1). Each year has a y-axis value, which is derived from a sample of a larger dataset. For instance, the first value might be the number votes in a given district or riding for 1992, with the average being around 33,000. On top of this is plotted the confidence interval -- the range of the number of votes which encapsulates 95% of the data (see the boxplot lectures for more information, and the yerr parameter of barcharts).

Figure 1 from (Ferreira et al, 2014).

A challenge that users face is that, for a given y-axis value (e.g. 42,000), it is difficult to know which x-axis values are most likely to be representative, because the confidence levels overlap and their distributions are different (the lengths of the confidence interval bars are unequal). One of the solutions the authors propose for this problem (Figure 2c) is to allow users to indicate the y-axis value of interest (e.g. 42,000) and then draw a horizontal line and color bars based on this value. So bars might be colored red if they are definitely above this value (given the confidence interval), blue if they are definitely below this value, or white if they contain this value.

Figure 2c from (Ferreira et al. 2014). Note that the colorbar legend at the bottom as well as the arrows are not required in the assignment descriptions below.

Easiest option: Implement the bar coloring as described above - a color scale with only three colors, (e.g. blue, white, and red). Assume the user provides the y axis value of interest as a parameter or variable.

Harder option: Implement the bar coloring as described in the paper, where the color of the bar is actually based on the amount of data covered (e.g. a gradient ranging from dark blue for the distribution being certainly below this y-axis, to white if the value is certainly contained, to dark red if the value is certainly not contained as the distribution is above the axis).

Even Harder option: Add interactivity to the above, which allows the user to click on the y axis to set the value of interest. The bar colors should change with respect to what value the user has selected.

Hardest option: Allow the user to interactively set a range of y values they are interested in, and recolor based on this (e.g. a y-axis band, see the paper for more details).



In [1]:

    
# Use the following data for this assignment:

import pandas as pd
import numpy as np
import scipy.stats as stats
import math

np.random.seed(12345)

df = pd.DataFrame([np.random.normal(33500,150000,3650), 
                   np.random.normal(41000,90000,3650), 
                   np.random.normal(41000,120000,3650), 
                   np.random.normal(48000,55000,3650)], 
                  index=[1992,1993,1994,1995])
df









    Out[1]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      3640
      3641
      3642
      3643
      3644
      3645
      3646
      3647
      3648
      3649
    
  
  
    
      1992
      2793.851077
      105341.500709
      -44415.807259
      -49859.545652
      328367.085875
      242510.874946
      47436.181512
      75761.922925
      148853.385142
      220465.210458
      ...
      138454.070217
      122488.069943
      162247.982356
      -273907.868554
      -138410.570396
      27638.756441
      -33120.047151
      -40989.824866
      94532.974507
      6128.841097
    
    
      1993
      -44406.485331
      180815.466879
      -108866.427539
      -114625.083717
      196807.232582
      47161.295355
      136522.083654
      58826.904901
      23329.019613
      -96417.638483
      ...
      -37809.868064
      93228.910228
      108183.379950
      146728.060346
      -10083.899508
      -31300.144215
      95017.857057
      164071.514663
      14409.944591
      33298.608969
    
    
      1994
      134288.798913
      169097.538334
      337957.368420
      -76005.273164
      90130.207911
      8453.626320
      -24562.317561
      195665.400438
      -53475.640770
      44708.230667
      ...
      145216.405451
      67773.006363
      95711.194465
      174500.629277
      -27821.888075
      -57881.583140
      26321.525617
      -21424.067186
      60164.652898
      -74750.286614
    
    
      1995
      -44485.202120
      -156.410517
      -13425.878636
      53540.999558
      130408.559874
      20445.656224
      60336.077232
      60688.099156
      -12748.496722
      57150.175074
      ...
      -636.804950
      49707.896672
      52687.564135
      13529.920850
      67016.324752
      41664.942829
      119870.946944
      56946.289297
      67927.466106
      32839.707999
    
  

4 rows × 3650 columns



In [2]:

    
df['mean'] = df.mean(axis=1)



In [3]:

    
z_critical = stats.norm.ppf(q = 0.975)
df['pop_stdev'] = df.std(axis=1)
df['95%'] = z_critical * (df['pop_stdev'] / math.sqrt(3650))



In [4]:

    
import matplotlib.pyplot as plt
%matplotlib notebook
df = df.reset_index()



In [45]:

    
import seaborn as sb
fig = plt.figure()
n_groups = 4
index = np.arange(n_groups)
ax1 = plt.bar(df['index'], df['mean'],
                 alpha=0.8,
              yerr=df['95%'],
                 color='c',
                 label='Mean')

xlabels = ["1992", "1993", "1994", "1995"]
plt.xticks(df['index'], xlabels)
plt.xlim((1991.5,1995.5))
plt.xlabel('Year')
plt.ylabel('Mean')
plt.title('Sample Bar Chart With User Interactivity')
plt.text(1991.55, 47000, 'User Input = NaN', style='italic', bbox={'facecolor':'k', 'alpha':0.2, 'pad':5})









    














    











    Out[45]:





<matplotlib.text.Text at 0x7f55ef69b828>



In [46]:

    
def onclick(event):
    plt.gcf().clear()
    n_groups = 4
    index = np.arange(n_groups)
    ax1 = plt.bar(df['index'], df['mean'],
                     alpha=0.8,
                  yerr=df['95%'],
                     color='c',
                     label='Mean')

    xlabels = ["1992", "1993", "1994", "1995"]
    plt.xticks(df['index'], xlabels)
    plt.xlim((1991.5,1995.5))
    
    plt.xlabel('Year')
    plt.ylabel('Mean')
    plt.title('Sample Bar Chart With User Interactivity')
    
    uinput = event.ydata
    plt.text(1991.55, 47000, 'User Input = {:,.0f}'.format(uinput), style='italic', bbox={'facecolor':'k', 'alpha':0.2, 'pad':5})
    
    df['user'] = event.ydata
    ax2 = plt.plot(df['index'], df['user'], '--r', alpha=0.8, color='m', label='User Value')
    
    for i in range(4):
        if df['mean'][i] < df['user'][0]:
            ax1[i].set_color('r')
            ax1[i].set_alpha(1-((((df['user'][0]-df['mean'][i])**2)**0.5) / 52000))
        elif df['mean'][i] == df['user'][0]:
            ax1[i].set_color('w')
        else:
            ax1[i].set_color('c')
            ax1[i].set_alpha(1-((((df['user'][0]-df['mean'][i])**2)**0.5) / 52000))
    


cid = fig.canvas.mpl_connect('button_press_event', onclick)

	0	1	2	3	4	5	6	7	8	9	...	3640	3641	3642	3643	3644	3645	3646	3647	3648	3649
1992	2793.851077	105341.500709	-44415.807259	-49859.545652	328367.085875	242510.874946	47436.181512	75761.922925	148853.385142	220465.210458	...	138454.070217	122488.069943	162247.982356	-273907.868554	-138410.570396	27638.756441	-33120.047151	-40989.824866	94532.974507	6128.841097
1993	-44406.485331	180815.466879	-108866.427539	-114625.083717	196807.232582	47161.295355	136522.083654	58826.904901	23329.019613	-96417.638483	...	-37809.868064	93228.910228	108183.379950	146728.060346	-10083.899508	-31300.144215	95017.857057	164071.514663	14409.944591	33298.608969
1994	134288.798913	169097.538334	337957.368420	-76005.273164	90130.207911	8453.626320	-24562.317561	195665.400438	-53475.640770	44708.230667	...	145216.405451	67773.006363	95711.194465	174500.629277	-27821.888075	-57881.583140	26321.525617	-21424.067186	60164.652898	-74750.286614
1995	-44485.202120	-156.410517	-13425.878636	53540.999558	130408.559874	20445.656224	60336.077232	60688.099156	-12748.496722	57150.175074	...	-636.804950	49707.896672	52687.564135	13529.920850	67016.324752	41664.942829	119870.946944	56946.289297	67927.466106	32839.707999